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1 Introduction 

1.1 Motivation 

Recently, the market size of online game has been increasing astonishingly fast, and so does the 
importance of good game design. In online games, usually a human user competes with others, 
so the fairness of the game system to all users is of great importance not to lose interests of 
users on the game. Furthermore, the emergence and success of electronic sports (e-sports) and 
professional gaming which specially talented gamers compete with others draws more attention 
on whether they are competing in the fair environment. 

No matter how fierce the debates are in the game-design community, it is rarely the case that 
one employs statistical analysis to answer this question seriously. But considering the fact that 
we can easily gather large amount of user behavior data on games, it seems potentially beneficial 
to make use of this data to aid making decisions on design problems of games. Actually, modern 
games do not aim to perfectly design the game at once: rather, they first release the game, and 
then monitor users' behavior to better balance the game. In such a scenario, statistical analysis 
can be particularly helpful. 

Specifically, we chose to analyze the balance of StarCraft II"'"^, which is a very successful 
recently-released real-time strategy (RTS) game. It is a central icon in current e-Sports and 
professional gaming community: from April 1st to 15th, there were 18 tournaments of StarCraft 
II"'"'^. However, there is endless debate on whether the winner of the tournament is actually 
superior to others, or it is largely due to certain design flaws of the game. In this paper, we aim 
to answer such a question using traditional statistical tool, logistic regression. 

1.2 Problem Setting 

In 1 vs. 1 match of this game, each gamer chooses his/her race of army to play. There are three 
races: Terran, Protoss, and Zerg. Note that it is allowed for two gamers to choose the same 
race. Also, the mapj^is chosen, usually according to the rule of the tournament. When races 

^Actually, it is more accurate to call it the battleground of the war, but it is conventionally called as a map. 
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of two players and the map are chosen, two gamers begin a war until one gamer gives up and 
admits that he lost a game, or certain end-of-the-game conditions are met. 

In traditional games like chess or go, two gamers are in perfectly same condition except 
the right of the first move. In games like StarCraft II, the gamer can choose very important 
characteristics of his/her army to play, so whether it is a fair game is an important issue in 
the community. Particularly, people are extremely interested in whether playing a certain race 
is particularly advantageous against the other. For example, a lot of people argue that it is 
difficult for a Zerg player to defeat a Terran player. 

However, note that the balance between two races also depends heavily on the map they 
are playing in. For example, there are maps which bases of two players are located closely. It 
is generally conceived that such a location of bases favors the Terran race, because the Terran 
army is powerful but immobile relative to others. But there are numerous other factors of the 
map design that designers of the map can utilize to make the game balanced, and we are usually 
interested in overall effect of such factors. 

1.3 Data Description 

In this research, we used the result of 852 games in Global Starcraft League^^(GSL|^ from 
October 2010 to March 2011. In this the most prestigious league of StarCraft II™, 64 number 
of professional gamers compete to each other, and the winner of the league gets $87,000. 

Each record of data consists of the identifiers of two players who played the game against 
each other, their corresponding choice of races, the map which the game was played, the date of 
the match, and the duration of the match: see Table [TJ Players rarely change his race between 
games, since it requires a lot of pratice to be good at just a single race. Thus, one may view 
two variables as two levels of a hierarchical information: race provides higher-level information 
and player provides lower-level information. There was only one gamer who played randomly 
chosen race for just two games until being eliminated from the tournament, and his games were 
omitted from data. 

There are 136 users and 14 maps in this data. The data was gathered from the official 
website of GSlFl using Python-based web crawler we created on our own. 



Variable 


Type 


Description 


Example 


Winner 


Binary (0 or 1) 


1 if Player 1 won the game, otherwise 


0, 1 


Player 1 


Nominal 


Identity of one player of the game 


Jonathan Walsh 


Race 1 


Nominal 


The race of the Player 1 


Protoss 


Player 2 


Nominal 


Identity of the other player of the game 


Greg Fields 


Race 2 


Nominal 


The race of the Player 2 


Zerg 


Map 


Nominal 


The map the game was played 


Xel'Naga Caverns 


Date 


Interval 


The date of the match 


Jan. 01, 2011 


Duration 


Interval 


The duration of the match 


21 min 35 sec 



Table 1: List of Variables in the Data 



^ http : //wiki . teamliquid . net/starcraf t2/G0MTV_Global_Starcraf t_II_League 
''http : / /esports .gomtv . com/gsl/ 



Technical Report (2011), 



2 



2 Methods 



2.1 Specification of Model 

Recall from Table [T] that Winneri = 1 if Player 1 wins the game. We model each Winneri to 
be independent Bernoulli random variable with m := PiWinneri = 1) as: 



logit{-Ki) := log 



vr,; 



1 - vr,; 



Racel,Race2)i • 



(1) 



Let us try to undertand the intuition behind the model, using an example with a figure. 
Suppose that famous gamers Greg Fields and Jonathan Walsh are playing in the map, Xel'Naga 
Caverns. Please refer to equation ([T]) and Figure [T] 



Pcreg Fields = 2.098 




onathanW alsh 



-1.797 



/^Xel'Naga Caverns, Zerg,Termn — 0.0632 

0.36336 ^ 2.098 - 1.797 + 0.0632 



Figure 1: Illustration of the Model 

If player 1 is a great gamer, than f3piayeri will be high, and it will increase vr, consequently 
increasing the probability that he will win the game. Greg Fields' is one of the greatest Zerg 
player in the world, so his estimated (3 is very high: PcregFieids = 2.098. 

On the other hand, no matter how good the player 1 is, if his opponent, player 2, is also 
a strong one, then the probability he will win the game should certainly decrease. It is being 
considred by I3piayer2- The opponent Jonathan Walsh is also one of the greatest Terran players, 
and his estimated parameter PjonathanWaish = 1.797. This value is subtracted from PcregFieids, 
decreasing Greg's probability of winning the game. 

However, we should consider the map two are playing in. It is Xel'Naga Caverns, and it turns 
out that the map favors Zerg slightly over Terran. Thus, /3(Xer Naga Caverns, Zerg, Terran) 

= 0.0632, 

increasing the probability Greg wins on this map. 

Finally, since logit{TTi) = 0.36336, we may transform the logit function to get 



P(Greg Wins) := vrj 



1 + exp(-0.36336) 



0.516, 



(2) 



using standard mathematical operation used frequently in logistic regression. Since both are 
very good players, it turns out that it is very hard to predict who will win the game. This is 
sensible, since the game is by its nature not very predictable. If we could easily predict the result 
of the game, nobody would want to watch the game to waste time. Using statistical analysis, 
however, we can get the overall tendency in games, even though the signal may not be very 
strong. In this case, the fitted model tells you it is more likely for Greg to win over Jonathan. 
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Note that we are assuming each games to be conditionally independent, given players and 
map of the game. The possible problems of this assumption will be discussed with other limits 
of the model in Section [2. 3.21 



2.2 Restrictions on Parameter Space 

Note that when the position of Player 1 and Player 2 is switched in equation ([T]), then it 
should still give us an equivalent result. In the former example of Greg and Jonathan, as 
P(Greg Wins) = 0.516, i-*(Jonathan Wins) should be 1 — 0.516 = 0.484. To impose such a 
restriction, we need following conditions: for each map, we should have 

(3) 
(4) 

PMap,Protoss,Zerg = ~ f3Map,Protoss,Zerg ■ (5) 

Such a restriction can be naturally done within the framework of standard logistic regression, 
not making the inference step any harder. One has to use carefully designed data matrix, but 
details are omitted since it is pretty straightforward. 



2.3 Characteristics of the Model 

2.3.1 Advantages over Traditional Approach 

Let us discuss why we need such a statistical model to analyze this data. To compare the 
performance of each players, the traditional way of analyzing the result of games is to calculate 
the win rate of individual players. For example, one may calculate the fraction of game Greg 
won and the fraction of game Jonathan won, and compare two numbers. But this method 
is obviously problematic, since in such a calculation, it does not distinguish how strong one's 
opponent had been. If one gamer has only encountered weak gamers by chance, and have not 
yet been challenged by strong ones, then would you still think he is a good gamer only because 
he has a good win rate? Certainly not. The beauty of having a statistical model is that we 
can take care of this. For example, Greg won only 41.67% of the game, while Jonathan won 
50%. But the fitted model does not tell Jonathan is a better player, due to their respective 
history of match: instead, it tells Greg is generally a better player, with its parameter values 

PcregFields = 2.098 and f^GregFields = 1-797. 

On the other hand, it is always a hot debate whether a certain map is balanced or not. But 
this is a hard question to answer, since you cannot simply say that the map named Xel'Naga 
Caverns favors Zergs, since many Zergs are winning over Terrans in this map. Maybe we have 
not seen good enough Terrans playing in this map. Or, we have not seen enough observations 
in this map. When the fraction of games Zerg has won over Terran is calculated (for example, 
= 0.6), what is its standard error? It is hard to answer, since each game is not marginally 
independent of each other. We know that good gamers are more likely to win, while bad gamers 
are less likely to do so. However, it is much more reasonable to assume that they are conditionally 
independent, which is our assumption, and in this case we can estimate the variability of our 
estimates. 

Finally, it is hard to combine estimates in traditional approach. When Jonathan is playing 
a game with Greg in Xel'Naga Caverns, how would you combine both gamers' win rate and the 
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fraction of Zerg won over Terran in Xel'Naga Cavern? Usually, people stop to be quantitative, 
and follow the qualitative approach. In our model, we can quantitatively combine fs to calculate 
overall effect. 

2.3.2 Limits 

No matter how more attractive the model is compared to traditional approaches, it is by no 
means a perfect model. To list why: 

• Constancy of Parameter Values over Time: The strength of each gamer is not constant 
over time. As a gamer accumulates experience, one generally gets better and better. On 
the other hand, it has been frequently observed that once legendary gamers become plain 
ones as they grow old and they cannot react as quickly as younger gamers. Thus using a 
single f3 parameter for every game is actually problematic. However, in this data such an 
assumption was inevitable since we have only observed for seven months. When a study 
with larger longitudinal scale is done, we may even attempt to model this time-series effect. 

• Conditional Independence: Each game may not be even conditionally independent given 

both players and map. For example, when two gamers are playing a best-of-five match, 
then the result of the first match certainly affects the second. When a player loses the 
first match, the player may get depressed, or having already seen which kind of style his 
opponent played, he may adjust his style very well and win the following game. But since 
most of the match was played as a league match or as a best-of-three match, we assume 
that such a dependence between games is not very strong. Also, since we are having a 
lot of games (852), such a dependence between three or five games may not play a very 
significant role. 

• Interaction Between Players: Everyone also knows that there should be a certain interac- 
tion effect between players. For example, a player with aggresive style is hard to win over 
a very defensive gamer. However, a defensive gamer may have hard time fighting agains a 
gamer who exploits the fact his opponent is defensive, and make a expansion very quickly. 
Since we have only 852 games and there are 136 gamers in the data, it is not possible to 
estimate 136 x 136 = 18496 parameters with 852 numbers of games. However, we may 
partially take this into account using mixed-membership or latent feature models. We 
leave this interesting possibilities for future work. 

• Interaction between Player and Map: Sometimes it is clearly seen that certain user is very 
good at certain kinds of map. But although this kind of interaction is more tractable to 
deal with compared to player-player interaction, consideration of such factors are left for 
future work. 

2.4 Model Pcirsimony 

Although we have been trying hard to keep our model simple, we still have too many parameters, 
since we give each player one parameter. Since some gamers played only one or two games just 
to lose and then be eliminated from the tournament, modeling even such gamers will result in 
over-fitting of the problem and numerical instability. Such a problem can generally be taken 
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care of by using regularized estimation approach, but it is slightly out of the scope of the course 
and we lose the notion of probability there. 

Instead, one may try to do the variable selection himself/herself, not relying on indirect 
regularized estimation. In general data analysis problems, this is hard to do since it is hard 
to consider every combination of variables. However, it is not that difficult when one has good 
idea of which variables are not very necessary, and it turns out to be the case here: we know 
that the players with small number of games are problematic. Thus, we fix /3's of such players 
to be 0. That is, the model gives up to estimate the performance of gamers who have not yet 
played enough, since we do not have enough data. However, those gamers are collectively taken 
into account, since they do affect the estimation of performance of other players and balance 
of the map. And it has very natural interpretation: when two players with not enough data 
are playing against each other, the reasonable prediction is that it is a 50-50 game. When we 
know what map they are playing in, then we may use the overall trend in the map to predict 
the result, not being able to use further information at all. It turns out to give similar result 
compared to the use of Li regularization (lasso), which will be discussed in the next section^ 



3 Results 

3.1 Adequacy of the Model 

Firstly, the lack of fit was tested: as a first step, it was done against constant model (/3j = for 
all j), and the value was 10~®, naturally rejecting the null hypothesis that the constant model 
suffices in almost any significance level. Since this is almost always the case for the data with 
considerable size, we also conducted Hosmer-Lemeshow teslj^ using 10 groups: the p-value was 
0.153, again favoring our modej^ 

Secondly, more modern method of cross-validation was used to evaluate the quality of the fit. 
We conducted 10-fold cross-validation and evaluated accuracy of our predictor in both training 
and test data. Note that in our case, losses of type I and II prediction error are the same 
(symmetric loss), so it suffices to check accuracy, unlike general cases. The average accuracy 
in training data was 0.727 it 0.00797 (mean it standard deviation), while that of test data was 
0.706ib0.0632. It seems our model generalizes quite well (just 2% drop of accuracy), but standard 
deviation is a bit high. The reason should be that we do not have enough data about every 
player: for some players with small number of games, training data may not contain enough 
information about them and cause inaccuracy of estimation for them, although on average the 
model works quite well. To see whether overfitting is problematic here, we've also used Li 
penalty (lasso) to estimate parameters and evaluated its accuracy on exactly same partitions of 
data. The accuracy was 0.708 it 0.0676, which is not significantly different from the model using 
no penalty. We did not set any parameter to be for lasso, but the set of nonzero parameters 
chosen by lasso using another 10-fold cross-validation on training set was very similar to what 
we've done by using the number of games each gamer played: lasso also removed 94% of players 
we removed. In conclusion, overfitting was not a big problem, and our selection of variables was 
not as ad-hoc as it could have sounded. 

*Note: the model is unidentifiable by itself, but by setting parameters for some players to 0, it becomes 
identifiable. It is identifiable without such a treatment in lasso case. 

code in http://www.stat. purdue.edu/''ovitek/STAT526-Springll_files/4-logistic.R were used 
®In this case, null hypothesis is our model. 
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Figure 2: Residual vs. Fitted Plot 



Thirdly, we visually checked the quality of fit. See Figure [2| The plot is not as flat as it is 
desired to be. This is because our fit is not perfect: sometimes we predict a player to win by 
high probability, which does not turn out to be the case. Although we did not suffer overfitting 
in terms of accuracy, the lack of regularizers may result in overfitting of estimated probabilities, 
sometimes being overconfident when the model should not be. Such an overfitting may naturally 
occur when dealing with mid-sized data like this. The use of regularization does not really help, 
since in that case we lose the notion of probability. 

Lastly, we checked whether our conditional indepdendence assumption was adequate. When 
quasi-binomial model was fit, the dispersion parameter was only 1.158738. To estimate the 
distribution of estimated dispersion parameter, we bootstrap sampled 1000 datasets. The mean 
and standard deviation of estimates were 1.275 it 0.284, clearly indicating that there does exist 
over-dispersion, but the magnitued is not very serious. 

3.2 Interpretation 

Recall that there is one parameter given to each player, which evaluates relative performance 
compared to others. Figure [Sj (1) plots estimated parameter for each player: it is naturally 
centered in the point bigger than zero, implying players with enough information are better than 
those players whose parameters were set to be zero because they did not play enough games. 
For interested readers about ranks of parameters, refer to Table |6j 

In Figure [sj (2) to (4), parameters which estimate the balance between two races for each 
map were displayed. Since there are only 14 maps, the histogram is very spiky. Mean and 
standard deviation of estimates regarding Terran vs. Protoss, Terran vs. Zerg, and Protoss vs. 
Zerg balance of map was respectively 1.064± 0.821, 0.749 it 0.566, and —0.369 it 0.596. It seems 
like the balance depends on the map, but most maps favor Terran over Protoss and Zerg, while 
the balance between Protoss vs. Zerg seemed more adequate than others. 

To answer the higher-level question of "So, is the game well-balanced?", we need to average 
over maps, since maps already take balances into account individually. Note that it is similar 
to testing hypothesis about the overall mean in cell- means model of one-way ANOVA. The 
parameter we test is: 

^ m 

PRacel,Race2 = — / 0Map„i,B.acel,Race2, (6) 
1=1 

for each Racel, Race2. Since logistic regression does not have closed- form solutions of parameter 
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Figure 3: Estimated Parameters 
From left: (1) Estimated parameter for each player, (2) Estiamted parameters for Terran vs. Protoss in 
each map, (3) Same plot for Terran vs. Zerg. (4) Same plot for Protoss vs. Zerg. 




Figure 4: Bootstrap distributions of parameter estimates (Boostrap sample number: 10000) 
distributions, we bootstrap sampled 10000 datasets and estimated ([g]) for each race combination. 

As a result, P{/3Terran,Protoss > 0) 0.839, P {l3Terran,Zerg > 0) PS 0.948, and P{fiprotoss,Zerg > 

0) ~ 0.290, which implies that the unbalance between races are not very significant in 5% 
significance level even when each hypothesis that the parameter value is exactly zero is tested 
individually (when multiple hypotheses are simulatenously checked, the significance level of the 
test drops). However, certainly indications were seen that there may be some balance problems, 
especially in the case of Terran vs. Zerg. Interested readers may refer to histograms of estimates: 
Figure |4} 

4 Discussion 

To authors' knowledge, this is the first time a standard statistical technique which is more 
complex than mere summary statistics were used to analyze user behavior data in online games. 
Using our technique, game designers may make use of the results they gained from beta-testers 
more carefully to reduce the cost of testing. Especially in the past with StarCraft I, many times 
very unbalanced maps were sometimes used in the tournament, causing some strong players to 
be eliminated even early in the tournament. Our model would be very helpful to prevent such 
a disaster. 

Since we've already discussed many of the technical problems in above sections, we conclude 
this section briefly. 
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5 Appendix 



5.1 Appedix A : Descriptive Statistics 
5.1.1 Number of races 

There are three types, Protos, Teran, Zerg, of races in the Starcraft II. From the raw data, we 
found three observation whose race is r. Only one player (ID : GuMihof Ou ) used option random 
(rare) which randomly assigns one of three races. Since he had played only three games in total, 
we eliminated these observations. After this elimination, we have 136 players. The bar graph 
for race of these 136 players is shown in Figure [5} 




Figure 5: Number of Races 

There is an outstanding preference to Teran which possesses 42.6% of the total players. At 
this point, balancing between races can be issued. We want to analyze this balancing problem 
using statistical approach. 

5.1.2 Number of observations (games) per player 

There are 852 observations in the data set. The average number of games of each player is 6.26. 
However, it is well-known that a better player plays more games than others. In other words, 
there should be a large deviation of the number of games of the players. We observe this using 
a histogram in Figure [6] and Table [2] 



games 


glayers 


games 


glayers 


1-5 


64 


31-35 


2 


6-10 


21 


36-40 


5 


11-15 


9 


41-45 


3 


16-20 


7 


46-50 


2 


21-25 


10 


51-55 





26-30 


12 


56-60 


1 



Table 2: Number of games of players 



64 players (47%) played only 1-5 games. Especially, 38 players (27.9%) played only at most 
2 games. Statistical results based on such players may not reliable. Hence we need to consider 
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data reduction. For example, a bar graph of players who had played more than 5 games is in 
shown in Figure [7} 

5.1.3 Game frequencies between races 

Table [3] shows how many games are done between each race combination. Notice that each 
combination is not in order. For example, the frequency of Protoss vs . Teran covers both 
(player 1 ,player2) = (Protoss, Teran) and (Teran, Protoss). 

For balancing analysis purpose, we reduce our focus on battles between different races. 





0) 

a 
E 

3 
Z 




I \ \ \ \ \ I 

10 20 30 40 50 60 



Number of games 



Figure 6: Game frequencies 




6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 

Number of games 



Figure 7: Number of players vs. Number of games 
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5.1.4 Game frequencies between difference races 

Table |4] shows the number of games of different races. 

As an intuitive check, we can compare the win/loss ratios, ^2\'+^^2 ~ 0.5193133 

0.5 



121+112 v^.^-L^^-Lt^-^, 132+1333 

0.4981132 and g^^pgj = 0.5114504. At a glance, those ratios do not look considerably apart from 



5.1.5 Time trend of the number of players for each race 

Observing trend of race proportions will help understand balancing problem. We divide the 
data into 7 sub-data by months, and see the number of players of each race. Each cell count is 
frequency and the numbers in each parenthesis is the proportion of each race conditioning on 
each period row. Refer to the Table [5] and Figure [Sj 



Races 


Frequency 


Protoss vs. Protoss 


45 


Protoss vs. Teran 


233 


Protoss vs. Zerg 


131 


Teran vs. Teran 


134 


Teran vs. Zerg 


265 


Zerg vs. Zerg 


44 



Table 3: Number of games of race combination 



Race vs. Race 


Frequency 


Number of Wins 


Teran vs. Protoss 


233 


Teran: 121 


Protoss: 


112 


Teran vs. Zerg 


265 


Teran: 132 


Zerg: 


133 


Protoss vs. Zerg 


131 


Protoss: 67 


Zerg 


: 64 



Table 4: Game frequencies between difference races 



Period 


Protoss Players 


Teran Players 


Zerg Players 


September, 2010 
October, 2010 
November, 2010 
December, 2010 
January, 2011 
February, 2011 
March, 2011 


16 (0.37209) 
20 (0.31746) 
12 (0.19047) 

4 (0.20000) 

17 (0.26984) 
17 (0.24285) 
12 (0.28571) 


17 (0.39534) 
28 (0.44444) 
25 (0.39682) 
9 (0.45000) 
28 (0.44444) 
32 (0.45714) 
19 (0.45238) 


10 (0.23255) 
15 (0.23809) 
26 (0.41269) 

7 (0.35000) 
18 (0.28571) 
21 (0.30000) 

11 (0.26190) 



Table 5: Time trend of the number of players for each race 
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5.2 Appedix B: Validation check with ranks 

Table [6] shows the prize ranks up to March 19th, 2011. The third column is the estimated ranks 
based on our model. Prize ranks more than 20 were not publicized thus not displayed. 



T~» 1 

Rank 


Name 


T ~» 1 • T-i • TV iT /T7" TXT \ 

Rank m Prize Money (Korea Wonj 


1 


Mm-Lynul uhang 


1 


2 


Yong-Hwa Choi 




3 


Kang-Ho Hwang 




4 


Jae-Duk Lim 


2 


5 


Young-Jin Kim 




6 


Jun-Sik Yang 




7 


Sung-Jun Park 


8 


8 


Jun-hyuk Song 


15 


9 


Hyun-Woo Park 




10 


Won-Ki Kim 


3 



Table 6: Parameter Estimate Rank and Prize Money Rank(up to March. 19. 2011). 

Although it is not very coherent with the prize rank, as experts of this problem we see that 
this rank to be very convincing. Some of the gamers ranked high here have been recently came 
to the tournament, not having enough opportunities to get high prize money. 
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Figure 8: Proportions of races according to time 
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